NOTE: TO FULLY UNDERSTAND THIS LECTURE YOU WILL NEED TO KNOW HTML AND CSS, YOU WILL ALSO NEED TO KNOW THE PIPE OPERATOR IN R (%>%). COME BACK TO THIS LECTURE AFTER REVIEWING THAT MATERIAL
Web Scraping in general is almost always going to be unique to your personal use case, this is because every website is different, updates occur, and things can change. To fully understand webscraping in R, you'll need to understand HTML and CSS in order to know what you are trying to grab off the website.
If you don't know HTML or CSS, you may be able to use an auto-web-scrape tool, like import.io. Check it out, it will auto scrape and create a csv file for you.
Below is a simple example of using rvest, but the best way to see a good demo of rvest is through the built-in demos by using:
demo(package='rvest')
Now if you are familiar with HTML and CSS a very useful library is rvest. Below we will go over a simple example from RStudio:
# Will also install dependencies
install.packages('rvest')
Imagine we’d like to scrape some information about The Lego Movie from IMDB. We start by downloading and parsing the file with html():
library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
To extract the rating, we start with SelectorGadget to figure out which css selector matches the data we want: strong span. (If you haven’t heard of selectorgadget, make sure to read "SelectorGadget" – it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use html_node() to find the first node that matches that selector, extract its contents with html_text(), and convert it to numeric with as.numeric():
lego_movie %>%
html_node("strong span") %>%
html_text() %>%
as.numeric()
We use a similar process to extract the cast, using html_nodes() to find all nodes that match the selector:
lego_movie %>%
html_nodes("#titleCast .itemprop span") %>%
html_text()
The titles and authors of recent message board postings are stored in a the third table on the page. We can use html_node() and [[ ]] to find it, then coerce it to a data frame with html_table():
lego_movie %>%
html_nodes("table") %>%
.[[3]] %>%
html_table()
Alright, hopefully this lecture gives you some good resources and ideas in case you want to webscrape with R in the future!